The semester project gives you the chance to apply the knowledge and skills you will learn in the course in a way that mimics the ways you can expect to use logistic regression modeling in a real-world setting.
For the project, you will choose one data set from the two listed below and perform a logistic regression analysis. Specifically, you will build a regression model and report on the model statistics and diagnostics. Your final deliverable for the project will be an R Markdown file. The file will contain the analyses detailed in each step with 5-7 written bullet points. Basically, pretend you are presenting to a non-technical audience, and the bullet points would serve as a script outline for your explanation. By non-technical I mean an audience who knows things like “a p-value less than 0.05 mean statistical significance” but cannot really explain the underlying concepts in great detail.
You can choose one of two different data sets to complete the project, either the wine quality data set first analyzed during week 3 or the data set diabetes from the faraway package, which we do not cover elsewhere in the course. Here are the details for each. Remember, you just need to choose one, not both.
wine quality: rather than building a model to discriminate between white and red wine, for the project you may collapse the quality variable into a binary response equal to 1 (high) if quality > 5 and 0 (low) if quality <= 5. Build a logistic regression model to explain what factors are related to a high rating.
diabetes: the help file for the data states that glycosolated hemoglobin (denoted as glyhbin the data) greater than 7.0 is usually taken as a positive diagnosis of diabetes. Thus, collapse glyhb into a binary response equal to 1 if gly > 7 (positive) and 0 if gly <= 7 (negative). Build a logistic regression model to explain what factors are related to a positive diagnosis.
library(tidyverse)
library(ROCR) # <-- for AUC
library(ResourceSelection) # <-- for Hosmer-Lemeshow Goodness of Fit Test
library(splines) # <-- for splines
wine_red <- read_delim('winequality-red.csv',delim = ';') %>%
mutate(quality=factor(ifelse(quality>5,1,0),labels = c('low','high')))
names(wine_red)<-make.names(names(wine_red),unique = TRUE)
I am analyzing the red wine dataset for my final project. This dataset was collected in 2009 and includes objective sensor data and a subjective quality measure about Vinho Verde style red wine from Northern Portugal. The data has 1599 rows and 11 physicochemical predictors and a single quality score from 0-10 which is the median score of at least 3 evaluations collected from wine experts. For simplicity I will be predicting if the score is high (greater than 5) or low (5 or less). The predictors are listed below :
Source for Variable Description
These Predictors will be used with Logistic Regression modeling techniques to predict the subjective quality of the wine.
The project consists of five steps, which you will work on throughout the second half of the semester. The steps you complete each week will accumulate in your R Markdown file, due at the end of the course. You will receive credit for completing drafts of Steps 1- 5 according to the schedule below. Each of these steps are described in greater detail in the week they are due.
Task: Construct interleaved histograms and scatterplots to explore the relation between the predictors and response. Specifically, choose two predictors and make an interleaved histogram and scatterplot for each. Thus, you should construct four total graphs.
ggplot(
data = wine_red,
aes(x=alcohol,fill=quality)
) + geom_histogram(position = 'dodge',bins = 30) +
ggtitle("Histogram of Wine Quality by Alcohol Percent")+
xlab("Alcohol %")
ggplot(
data = wine_red,
aes(x=total.sulfur.dioxide,fill=quality)
) + geom_histogram(position = 'dodge',bins = 30) +
ggtitle("Histogram of Wine Quality by Total Sulfur Dioxide") +
xlab("Total Sulfur Dioxide")
ggplot(
data = wine_red,
aes(x=total.sulfur.dioxide,y=quality,color=quality)
) + geom_jitter(alpha=.5) +
ggtitle("Scatterplot of Wine Quality by Total Sulfur Dioxide") +
xlab("Total Sulfur Dioxide")+
ylab("Quality")
ggplot(
data = wine_red,
aes(x=alcohol,y=quality,color=quality)
) + geom_jitter(alpha=.5) +
ggtitle("Scatterplot of Wine Quality by Alcohol %") +
xlab("Alcohol %")+
ylab("Quality")
Task: Perform variable selection via the AIC using the step() function. Your starting model should include all available predictors. The reduced model should be used as your final model for all subsequent steps. Or, if you disagree with the recommended model, you need to indicate why.
wine_red <- wine_red %>%
mutate(quality=ifelse(quality=='high',1,0))
saturated_model <- glm(quality~.,family=binomial,data=wine_red)
aic_step <- step(saturated_model)
## Start: AIC=1679.63
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol
##
## Df Deviance AIC
## - pH 1 1655.9 1677.9
## - density 1 1656.0 1678.0
## - residual.sugar 1 1656.7 1678.7
## - fixed.acidity 1 1657.5 1679.5
## <none> 1655.6 1679.6
## - citric.acid 1 1660.8 1682.8
## - chlorides 1 1662.1 1684.1
## - free.sulfur.dioxide 1 1662.9 1684.9
## - total.sulfur.dioxide 1 1690.2 1712.2
## - sulphates 1 1698.8 1720.8
## - volatile.acidity 1 1705.9 1727.9
## - alcohol 1 1730.0 1752.0
##
## Step: AIC=1677.9
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + sulphates + alcohol
##
## Df Deviance AIC
## - density 1 1657.2 1677.2
## - residual.sugar 1 1657.5 1677.5
## <none> 1655.9 1677.9
## - citric.acid 1 1661.2 1681.2
## - chlorides 1 1662.1 1682.1
## - fixed.acidity 1 1662.8 1682.8
## - free.sulfur.dioxide 1 1663.0 1683.0
## - total.sulfur.dioxide 1 1690.7 1710.7
## - sulphates 1 1701.1 1721.1
## - volatile.acidity 1 1706.8 1726.8
## - alcohol 1 1751.2 1771.2
##
## Step: AIC=1677.19
## quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +
## chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## sulphates + alcohol
##
## Df Deviance AIC
## - residual.sugar 1 1657.8 1675.8
## <none> 1657.2 1677.2
## - citric.acid 1 1662.5 1680.5
## - chlorides 1 1663.1 1681.1
## - fixed.acidity 1 1663.3 1681.3
## - free.sulfur.dioxide 1 1664.2 1682.2
## - total.sulfur.dioxide 1 1691.4 1709.4
## - sulphates 1 1701.1 1719.1
## - volatile.acidity 1 1713.3 1731.3
## - alcohol 1 1839.5 1857.5
##
## Step: AIC=1675.84
## quality ~ fixed.acidity + volatile.acidity + citric.acid + chlorides +
## free.sulfur.dioxide + total.sulfur.dioxide + sulphates +
## alcohol
##
## Df Deviance AIC
## <none> 1657.8 1675.8
## - citric.acid 1 1663.0 1679.0
## - chlorides 1 1663.5 1679.5
## - fixed.acidity 1 1664.2 1680.2
## - free.sulfur.dioxide 1 1665.2 1681.2
## - total.sulfur.dioxide 1 1691.4 1707.4
## - sulphates 1 1701.2 1717.2
## - volatile.acidity 1 1713.4 1729.4
## - alcohol 1 1843.8 1859.8
formula(aic_step)
## quality ~ fixed.acidity + volatile.acidity + citric.acid + chlorides +
## free.sulfur.dioxide + total.sulfur.dioxide + sulphates +
## alcohol
Task: Check that continuous predictors have a linear relation with the logit using loess plots and perform the Hosmer-Lemeshow (HL) goodness of fit test. If the loess plot is nonlinear, then splines should be used to account for the nonlinearity.
loess_plot <- function(column_name){
f <- as.formula(paste("quality", column_name, sep = " ~ "))
y_smooth <- predict(loess(f, data=wine_red))
zero_one <- which(y_smooth>0 & y_smooth<1)
plot(jitter(wine_red[[column_name]])[zero_one],log(y_smooth[zero_one]/(1-y_smooth[zero_one])),
xlab = column_name)
}
loess_plot('alcohol')
The lowess plot for Alcohol clearly shows around 2 splines around 9.5% and 12.5%
loess_plot('volatile.acidity')
The lowess plot for Volatile Acidity does not show a strong need for splines
loess_plot('sulphates')
The lowess plot for Sulphates shows a clear need for a spline around 1
loess_plot('total.sulfur.dioxide')
The lowess plot for Total Sulfur Dioxide shows a potential spline around 40 is needed
loess_plot('free.sulfur.dioxide')
The lowess plot for Free Sulfur Dioxide shows splines around 10 and 15 could improve the model
loess_plot('fixed.acidity')
The lowess plot for Fixed Acidity shows that splines around 8 and 12 are needed
loess_plot('chlorides')
The lowess plot for Chlorides shows a spline around .09 should be added
loess_plot('citric.acid')
The lowess plot for Citric Acid shows that splines around .2 and .5 should be added
spline_formula <- formula(quality~ns(citric.acid,4)+
ns(chlorides,2)+
ns(fixed.acidity,2)+
ns(free.sulfur.dioxide,4)+
ns(total.sulfur.dioxide,3)+
ns(sulphates,4)+
volatile.acidity+
ns(alcohol,4))
step_with_splines_model <- glm(spline_formula,family=binomial,data=wine_red)
hl_test <- hoslem.test(wine_red$quality,aic_step$fitted.values,10)
hl_test
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: wine_red$quality, aic_step$fitted.values
## X-squared = 16.205, df = 8, p-value = 0.03954
hl_test <- hoslem.test(wine_red$quality,step_with_splines_model$fitted.values,10)
hl_test
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: wine_red$quality, step_with_splines_model$fitted.values
## X-squared = 10.983, df = 8, p-value = 0.2026
Most of the Loess Plots show that the predictors are nonlinear except Volatile Acidity and thus splines need to be used to create multiple regression lines for each of the remaining predictors in the model. Another choice could be to select a model like a decision tree or neural network that can model features with nonlinear behaviors
Task: Report p-values and confidence intervals for significant predictors and check for influential observations. Any influential observations should be removed and the model should be refit. Note any changes in the inferences due to the removal.
conf_int <- confint(step_with_splines_model) %>%
data.frame() %>%
rownames_to_column(var="Predictors")
p.values <- step_with_splines_model%>%
summary() %>%
coef()%>%
.[,4] %>%
data.frame()%>%
rownames_to_column(var="Predictors")
colnames(p.values) <- c("Predictors", "p.values")
conf_int %>% left_join(p.values)
## Predictors X2.5.. X97.5.. p.values
## 1 (Intercept) -3.3205881 1.7165029 5.413388e-01
## 2 ns(citric.acid, 4)1 -0.7720941 0.4066690 5.435578e-01
## 3 ns(citric.acid, 4)2 -2.4797959 -0.4478897 4.908975e-03
## 4 ns(citric.acid, 4)3 -2.4027709 0.7404139 2.970513e-01
## 5 ns(citric.acid, 4)4 -2.0003567 3.4452966 6.176041e-01
## 6 ns(chlorides, 2)1 -3.8543374 0.1440308 7.095380e-02
## 7 ns(chlorides, 2)2 -4.6187578 0.9007006 2.135363e-01
## 8 ns(fixed.acidity, 2)1 1.5357446 4.6409308 1.000350e-04
## 9 ns(fixed.acidity, 2)2 -1.4643868 1.1913960 8.224173e-01
## 10 ns(free.sulfur.dioxide, 4)1 -0.5079563 1.6765334 2.911363e-01
## 11 ns(free.sulfur.dioxide, 4)2 -1.1239664 1.0868087 9.704847e-01
## 12 ns(free.sulfur.dioxide, 4)3 -1.8555339 2.9882415 6.371468e-01
## 13 ns(free.sulfur.dioxide, 4)4 -0.3978137 3.9818804 8.596518e-02
## 14 ns(total.sulfur.dioxide, 3)1 -3.4966968 -1.2519440 3.499755e-05
## 15 ns(total.sulfur.dioxide, 3)2 -3.4254130 0.8721582 2.125156e-01
## 16 ns(total.sulfur.dioxide, 3)3 -4.5233351 1.5161565 2.243857e-01
## 17 ns(sulphates, 4)1 1.0769006 4.2157372 1.147396e-03
## 18 ns(sulphates, 4)2 1.3530764 4.0411745 9.196090e-05
## 19 ns(sulphates, 4)3 0.8300265 8.0500382 1.782228e-02
## 20 ns(sulphates, 4)4 0.2356723 4.6793212 2.108640e-02
## 21 volatile.acidity -4.4350368 -2.4208432 3.030894e-11
## 22 ns(alcohol, 4)1 -1.7647206 0.9364081 5.555788e-01
## 23 ns(alcohol, 4)2 1.3185135 3.9064505 6.660549e-05
## 24 ns(alcohol, 4)3 -2.1715741 4.1959644 5.225071e-01
## 25 ns(alcohol, 4)4 1.1333260 6.4132474 8.354436e-03
Get Cooks Distances
CooksDistance <- cooks.distance(step_with_splines_model)
summary(CooksDistance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.800e-07 5.527e-05 2.129e-04 7.395e-04 6.754e-04 8.246e-02
Count Cooks Distances above the suggested threshold of 1
sum(CooksDistance > 1)
## [1] 0
Count Cooks Distances above the other suggested threshold of 4/length(df)
sum(CooksDistance > 4/length(wine_red))
## [1] 0
Plot of the model fitted values (x-axis) versus the Cooks Distance values (y-axis). Any points lying far above the others should be investigated.
ggplot(
data = data.frame(model_fitted_values = step_with_splines_model$fitted.values,
Cooks_Distance = CooksDistance),
aes(x = model_fitted_values,y = Cooks_Distance)
) + geom_point()
After Adding splines a fair ammount of the parameters have 95% confidence intervals that include 0. This seems to be an artifact of the splines and per spline there is genrally a parameter with a parameter CI that does not include 0
Task: Summarize predictive/discriminatory power of the model with an ROC curve and plots of predicted probabilities.
pred <- prediction(step_with_splines_model$fitted.values, wine_red$quality)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(10), main="ROC Curve")
auc.tmp <- performance(pred,"auc");
auc <- as.numeric(auc.tmp@y.values)
cat(paste("The AUC for my selected model is:",round(auc,3)))
## The AUC for my selected model is: 0.839
plot(wine_red$fixed.acidity,step_with_splines_model$fitted.values)
plot(wine_red$volatile.acidity,step_with_splines_model$fitted.values)
plot(wine_red$citric.acid,step_with_splines_model$fitted.values)
plot(wine_red$chlorides,step_with_splines_model$fitted.values)
plot(wine_red$free.sulfur.dioxide,step_with_splines_model$fitted.values)
plot(wine_red$total.sulfur.dioxide,step_with_splines_model$fitted.values)
plot(wine_red$sulphates,step_with_splines_model$fitted.values)
plot(wine_red$alcohol,step_with_splines_model$fitted.values)
I built a model that describes the wine quality well enough. The Spline’s added to all but one predictors in the final model were invaluable. The ROC AUC jumped from ~0.17 to ~0.84 due to their addition. Many of the predictors seem to behave with higher order terms which were modeled with splines but it would have been interesting to look at higher order or interaction terms between the variables. There are some other objective predictors I would be interested in adding such as price per bottle, age of the wine, type of grapes used and wine brand information. Sadly all this information is not public with the dataset. I would be interested in getting data about the public’s scores for the wines tested and see if those scores generally track the scores that the experts gave each wine.
A Dataset like this could be used to help predict wines and characteristics of wines that will be successful on the market before any human even tests them. This can help wineries to produce better quality wines that taste better. Also if an individual tasted and scored many of these wines their individual preferences could be encoded into a model that could help them to select similar wines to wines that they have enjoyed before.
Comments on Variables Selected: